尽管预训练的语言模型(LMS)在许多NLP任务中都取得了重大改进,但人们越来越关注探索LMS的能力并解释其预测。但是,现有作品通常仅着眼于某些下游任务的特定功能。缺乏直接评估蒙版单词预测性能和预训练LMS的解释性的数据集。为了填补空白,我们提出了一个新颖的评估基准,以提供英语和中文注释的数据。它在多个维度(即语法,语义,知识,推理和计算)中测试LMS能力。此外,它提供了满足足够和紧凑性的仔细注释的令牌级别的理由。它包含每个原始实例的扰动实例,以便将扰动下的基本原理一致性用作忠实的指标,即解释性的观点。我们在几个广泛使用的预训练的LMS上进行实验。结果表明,他们在知识和计算的维度上表现较差。而且它们在所有维度上的合理性远非令人满意,尤其是当理由缩短时。此外,我们评估的预训练的LMS在语法感知数据上并不强大。我们将以\ url {http:// xyz}发布此评估基准,并希望它可以促进预训练的LMS的研究进度。
translated by 谷歌翻译
近年来,WiFi传感一直在迅速发展。通过传播模型和深度学习方法的能力,实现了许多具有挑战性的应用,例如基于WiFi的人类活动识别和手势识别。但是,与深入学习视觉识别和自然语言处理相反,没有足够全面的公共基准。在本文中,我们强调了最新的深度学习进展,使WiFi传感能够感测,然后提出了一个基准SensenFI,以研究各种深度学习模型对WiFi传感的有效性。这些高级模型是根据独特的传感任务,WiFi平台,识别精度,模型大小,计算复杂性,功能可传递性以及无监督学习的适应性进行比较的。从CSI硬件平台到传感算法,它也被认为是基于深度学习的WiFi传感的教程。广泛的实验为我们提供了深层模型设计,学习策略技能和培训技术的经验。据我们所知,这是第一个带开源库的基准,用于WiFi传感研究中的深度学习。基准代码可在https://github.com/chenxinyan-sg/wifi-csi-sensing-benchmark上获得。
translated by 谷歌翻译
WiFi sensing technology has shown superiority in smart homes among various sensors for its cost-effective and privacy-preserving merits. It is empowered by Channel State Information (CSI) extracted from WiFi signals and advanced machine learning models to analyze motion patterns in CSI. Many learning-based models have been proposed for kinds of applications, but they severely suffer from environmental dependency. Though domain adaptation methods have been proposed to tackle this issue, it is not practical to collect high-quality, well-segmented and balanced CSI samples in a new environment for adaptation algorithms, but randomly-captured CSI samples can be easily collected. {\color{black}In this paper, we firstly explore how to learn a robust model from these low-quality CSI samples, and propose AutoFi, an annotation-efficient WiFi sensing model based on a novel geometric self-supervised learning algorithm.} The AutoFi fully utilizes unlabeled low-quality CSI samples that are captured randomly, and then transfers the knowledge to specific tasks defined by users, which is the first work to achieve cross-task transfer in WiFi sensing. The AutoFi is implemented on a pair of Atheros WiFi APs for evaluation. The AutoFi transfers knowledge from randomly collected CSI samples into human gait recognition and achieves state-of-the-art performance. Furthermore, we simulate cross-task transfer using public datasets to further demonstrate its capacity for cross-task learning. For the UT-HAR and Widar datasets, the AutoFi achieves satisfactory results on activity recognition and gesture recognition without any prior training. We believe that the AutoFi takes a huge step toward automatic WiFi sensing without any developer engagement.
translated by 谷歌翻译
由于高速互联网访问的要求增加,WiFi技术已应用于各个地方。最近,除了网络服务之外,WiFi Sensing在智能家居中还具有吸引力,因为它是无设备,具有成本效益和隐私性的。尽管已经开发了许多WiFi传感方法,但其中大多数仅考虑单个智能家庭场景。没有强大的云服务器和大量用户的连接,大规模的WiFi感应仍然很困难。在本文中,我们首先分析和总结了这些障碍,并提出了一个有效的大规模WiFi传感框架,即有效的障碍。 EfficityFI与中心服务器处的WiFi APS和云计算一起使用Edge Computing。它由一个新颖的深神经网络组成,该网络可以在Edge处压缩细粒的WiFi通道状态信息(CSI),在云中恢复CSI,并同时执行感应任务。量化的自动编码器和联合分类器旨在以端到端的方式实现这些目标。据我们所知,EfficityFi是第一个启用IoT-Cloud WiFi传感框架,可大大减少开销的交流,同时准确地实现感应任务。我们通过WiFi传感利用人类活动识别和鉴定为两个案例研究,并进行了广泛的实验以评估有效性。结果表明,它将CSI数据从1.368MB/s压缩至0.768kb/s,数据重建的误差极低,并且可以达到超过98%的人类活动识别精度。
translated by 谷歌翻译
基于稳定性的概念,我们研究嘈杂随机迷你批量迭代算法的泛化界限。近年来,基于稳定性(Mou等,2018; Li等,2020)和信息理论方法(Mou等,2018)和信息理论方法(徐和Raginsky,2017; Negrea等,2019年; Steinke和Zakynthinou,2020; Haghifam等,2020)。在本文中,我们统一和基本上概括了基于稳定的泛化范围,并进行了三个技术进步。首先,我们在预期(不统一)稳定性方面绑定了一般噪声随机迭代算法(不一定梯度下降)的泛化误差。预期的稳定性又可以通过LE凸轮风格的偏差界定。与o(1 / \ sqrt {n})的许多现有范围不同,这种界限具有O(1 / n)样本依赖性。其次,我们介绍指数族族朗文动力学(EFLD),这是SGLD的大量概括,其允许与随机梯度下降(SGD)一起使用的指数家庭噪声。我们为一般EFLD算法建立基于数据相关的预期稳定性的泛化界。第三,我们考虑一个重要的特殊情况:EFLD的一个重要特殊情况:嘈杂的符号-SGD,它使用{-1,+ 1}的Bernoulli噪声扩展标志SGD。 EFLD的危识符号的泛化界限暗示了EFLD的暗示,我们还建立了算法的优化保证。此外,我们在基准数据集中呈现实证结果,以说明我们的界限与现有界限不上且定量。
translated by 谷歌翻译
A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
translated by 谷歌翻译
Prompting large language models has enabled significant recent progress in multi-step reasoning over text. However, when applied to text generation from semi-structured data (e.g., graphs or tables), these methods typically suffer from low semantic coverage, hallucination, and logical inconsistency. We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. MURMUR is a best-first search method that generates reasoning paths using: (1) neural and symbolic modules with specific linguistic and logical skills, (2) a grammar whose production rules define valid compositions of modules, and (3) value functions that assess the quality of each reasoning step. We conduct experiments on two diverse data-to-text generation tasks like WebNLG and LogicNLG. These tasks differ in their data representations (graphs and tables) and span multiple linguistic and logical skills. MURMUR obtains significant improvements over recent few-shot baselines like direct prompting and chain-of-thought prompting, while also achieving comparable performance to fine-tuned GPT-2 on out-of-domain data. Moreover, human evaluation shows that MURMUR generates highly faithful and correct reasoning paths that lead to 26% more logically consistent summaries on LogicNLG, compared to direct prompting.
translated by 谷歌翻译
Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth. To effectively regulate models, various improved L2 loss functions have been proposed to find a better correspondence between predicted density and annotation positions. In this paper, we propose to predict the density map at one resolution but measure the density map at multiple resolutions. By maximizing the posterior probability in such a setting, we obtain a log-formed multi-resolution L2-difference loss, where the traditional single-resolution L2 loss is its particular case. We mathematically prove it is superior to a single-resolution L2 loss. Without bells and whistles, the proposed loss substantially improves several baselines and performs favorably compared to state-of-the-art methods on four crowd counting datasets, ShanghaiTech A & B, UCF-QNRF, and JHU-Crowd++.
translated by 谷歌翻译
Crowd localization aims to predict the spatial position of humans in a crowd scenario. We observe that the performance of existing methods is challenged from two aspects: (i) ranking inconsistency between test and training phases; and (ii) fixed anchor resolution may underfit or overfit crowd densities of local regions. To address these problems, we design a supervision target reassignment strategy for training to reduce ranking inconsistency and propose an anchor pyramid scheme to adaptively determine the anchor density in each image region. Extensive experimental results on three widely adopted datasets (ShanghaiTech A\&B, JHU-CROWD++, UCF-QNRF) demonstrate the favorable performance against several state-of-the-art methods.
translated by 谷歌翻译
Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar topics. Most existing question answering (QA) datasets, in contrast, assume all questions have well defined answers. We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections. Through extensive baseline experiments, we show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct. This is in large part due to difficulty in retrieving relevant evidence passages from a large text corpus. CREPE provides a benchmark to study question answering in the wild, and our analyses provide avenues for future work in better modeling and further studying the task.
translated by 谷歌翻译